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Traditional language processing tools constrain language designers to specific kinds of grammars. 
In contrast, model-based language specification decouples language design from language pro- 
cessing. As a consequence, model-based language specification tools need general parsers able to 
parse unrestricted context-free grammars. As languages specified following this approach may be 
ambiguous, parsers must deal with ambiguities. Model-based language specification also allows 
the definition of associativity, precedence, and custom constraints. Therefore parsers generated 
by model-driven language specification tools need to enforce constraints. In this paper, we pro- 
pose Fence, an efficient bottom-up chart parser with lexical and syntactic ambiguity support that 
allows the specification of constraints and, therefore, enables the use of model-based language 
specification in practice. 



I. INTRODUCTION 

Traditional language specification techniques Q re- 
quire the developer to provide a textual specification of 
the language grammar. 

In contrast, model-based language specification tech- 
niques [131 allow the specification of languages by means 
of data models annotated with constraints. 

Model-based language specification has direct appli- 
cations in the following fields: programming tools [1|, 
domain-specific laiiguages [ol. ITol. lT6l|. model-driven soft- 
ware development [2l| , data integration , text mining 
d, [13] , natural language processing [ll| , and the corpus- 
based induction of models p^ . 

Due to the nature of the aforementioned application 
fields, the specification of separate language elements 
may cause lexical and syntactic ambiguities. Lexical am- 
biguities occur when an input string simultaneously cor- 
responds to several token sequences (TtI , which may also 
overlap. Syntactic ambiguities occur when a token se- 
quence can be parsed in several ways. 

The formal grammars of languages specified using 
model-based techniques may contain epsilon productions 
(such as E := e), infinitely recursive production sets 
(such a.s A :— c, A :— B, and B := A), and associa- 
tivity, precedence, and custom constraints. Therefore, a 
parser that supports such specification is needed. 

Our proposed algorithm, Fence, is a bottom-up chart 
parser that accepts a lexical analysis graph as input, per- 
forms an efficient syntactic analysis taking constraints 
into account, and produces a parse graph that represents 
all the possible parse trees. The parsing process discards 
any sequence of tokens that does not provide a valid syn- 
tactic sentence conforming to the language specification, 
which consists of a production set and a set of constraints. 
Fence implicitly performs a context-sensitive lexical anal- 
ysis, as the parsing process determines which token se- 
quences end up in the parse graph. Fence supports ev- 
ery possible construction in a context-free language with 
constraints, including epsilon productions and infinitely 



recursive production sets. 

The combined use of the Lamb lexical analyzer [l^ 
and Fence allows the generation of processors for lan- 
guages with ambiguities and constraints, and it renders 
model-based language specification techniques feasible. 
Indeed, ModelCC (201] is a model-based language speci- 
fication tool that relies on Lamb and Fence to generate 
language processors. 

II. BACKGROUND 

Language processing tools traditionally divide the 
analysis into two separate phases; namely, scanning (or 
lexical analysis), which is performed by lexers, and pars- 
ing (or syntax analysis), which is performed by parsers. 
However, language processing tools based on scannerless 
parsers also exist. 

A. Lexical Analysis Algorithms with Ambiguity Support 

Given a language specification describing the tokens 
listed in Figure [H the string "&5.2& /25.20/" can cor- 
respond to the four different lexical analysis alternatives 
shown in Figure [21 depending on whether the sequences 
of digits separated by points are considered real numbers 
or integer numbers separated by points. 

The productions shown in Figure[3] illustrate a scenario 
of lexical ambiguity sensitivity. Sequences of digits sepa- 
rated by points should be considered either Real tokens 
or Integer Point Integer token sequences depending on 

(-|\+)?[0-9]+ Integer 

(- I \+) ? [0-9] +\ . [0-9] + Real 

\ . Point 

\/ Slash 

\& Ampersand 

Figure 1 Specification of token types as regular expressions 
for a lexically-ambiguous language. 
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• Ampersand Integer Point Integer Ampersand Slash 
Integer Point Integer Slash 

• Ampersand Integer Point Integer Ampersand Slash 
Real Slash 

• Ampersand Real Ampersand Slash Integer Point 
Integer Slash 

• Ampersand Real Ampersand Slash Real Slash 

Figure 2 Different possible token sequences in the input string 
"&5.2& /25.20/" due to the lexically-ambiguous language 
specification shown in Figure [1] 

E : := A B 

A ::= Ampersand Real Ampersand 

B ::= Slash Integer Point Integer Slash 

Figure 3 Context-sensitive productions that resolve the am- 
biguities in Figure O 

the surrounding tokens, which may be either Ampersand 
tokens or Slash tokens. The desired result of analyzing 
the input string "&5.2& /25.20/" is shown in Figure H 

The further application of a parser supporting lexical 
ambiguities would produce the only possible valid sen- 
tence, which, in turn, would be based on the only valid 
lexical analysis for our example. The intended results are 
shown in Figure [SI 

The Lamb lexical analyzer [l^ captures all possible 
sequences of tokens within a given input string and it 
generates a lexical analysis graph that describes them 
all, as shown in Figure [5j In these graphs, each token is 
linked to its preceding and following tokens. There may 
also be several starting tokens. Each path in these graphs 
describes a possible sequence of tokens that can be found 
within the input string. 

To the best of our knowledge, the only way to process 
lexical analysis graphs consists of extracting the different 
paths from the graph and parse each of them. This pro- 
cess is inefficient, as partial parsing trees that are shared 
among different token sequences have to be created sev- 
eral times. 



B. Syntactic Analysis Algorithms 

Traditional efficient parsers for restricted context-free 
grammars, such as the LL LR [m], LALR [1,01, and 
SLR [6| parsers, do not consider ambiguities in syntactic 
analysis, so they cannot be used to parse ambiguous lan- 
guages. The efficiency of these parsers is 0(n), being n 
the token sequence length. 

Generalized LR (GLR) parsers parse in linear to 
cubic time, depending on how closely the grammar con- 
forms to the underlying LR strategy. The time required 
to run the algorithm is proportional to the degree of non- 
determinism in the grammar. The Universal parser (23j 
is a GLR parser used for natural language processing. 
However, it fails for grammars with epsilon productions 



and infinitely recursive production sets. 

Existing chart parsers for unrestricted context-free 
grammar parsing, as the CYK parser [l^, HI] and the 
Earley parser , can consider syntactic ambiguities but 
not lexical ambiguities. The efficiency of these general 
context-free grammar parsers is 0{n^), being n the to- 
ken sequence length. 

III. FENCE 

In this paper, we introduce Fence, an efficient bottom- 
up chart parser that produces a parse graph that contains 
as many root nodes as different parse trees exist for a 
given ambiguous input string. 

In contrast to the parsing techniques mentioned in the 
previous section. Fence is able to process lexical analy- 
sis graphs and, therefore, it efficiently considers lexical 
ambiguities. 

Fence also considers syntactic ambiguities, allows the 
specification of constraints, and supports every possible 
context-free language construction, particularly epsilon 
productions and infinitely recursive production sets. 

The Fence parsing algorithm consists of three consecu- 
tive phases: the extended lexical analysis graph construc- 
tion phase, the chart parsing phase, and the constraint 
enforcement phase. 

A. Terminology 

A context-free grammar is formally defined 31 as the 
tuple (iV,i;,F,S'), where: 

• iV is the finite set of nonterminal symbols of the lan- 
guage, sometimes called syntactic variables, none of 
which appear in the language strings. 

• E is the finite set of terminal symbols of the lan- 
guage, also called tokens, which constitute the lan- 
guage alphabet (i.e. they appear in the language 
strings). Therefore, E is disjoint from TV. 

• P is a finite set of productions, each one of the form 
N — > (EUiV)*, where * is the Kleene star operator, 
U denotes set union, the part before the arrow is 
called the left-hand side (LHS) of the production, 
and the part after the arrow is called the right-hand 
side (RHS) of the production, after the arrow is 
called the right-hand side of the production. 

• S* is a distinguished nonterminal symbol, S E N: 
the grammar start symbol. 

A dotted production is of the form N (I]UiV)*.(I]U 
N)* , where the dot indicates that the RHS symbols be- 
fore the dot have already been matched with a substring 
of the input string. 

A handle is a tuple {dottedproduction, [start, end]), 
where start and end identify the substring of the input 
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Figure 4 Desired lexical analysis of the lexically ambiguous "&5.2& /25.20/" input string. 




Figure 5 Lexical analysis graph, as produced by the Lamb lexer. 




Figure 6 Syntactic analysis graph, as produced by applying a parser that supports lexical ambiguities to the lexical analysis 
graph shown in Figure [5] Squares represent nonterminal symbols found during the parsing process. 




Figure 7 Extended lexical analysis graph corresponding to the lexical analysis graph shown in Figure [5l Gray nodes represent 
cores. 
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string that matched the dotted production RHS sym- 
bols before the dot. Each handle can be used during 
the parsing process to match a rule RHS symbol with a 
node representing either a token or a nonterminal symbol 
(namely, SHIFT actions in LR-like parsers) or perform a 
reduction (namely, REDUCE actions in LR-like parsers) . 
A core is a set of handles. 



B. Extended Lexical Analysis Graph Construction Phase 

In order to efBciently perform the parsing process. 
Fence converts the input lexical analysis graph (LA 
graph) into an extended lexical analysis graph (ELA 
graph) that stores information about partially ap- 
plied productions (namely, handles) in data structures 
(namely, cores). 

In an ELA graph, tokens are not linked to their pre- 
ceding and following tokens, but to their preceding and 
following cores. Cores are, in turn, linked to their pre- 
ceding and following token sets. For example, the ELA 
graph corresponding to the LA graph in Figure[5]is shown 
in Figure [7l 

The conversion from the LA graph to the ELA graph 
is performed by completing the LA graph with cores. 
A starting core is linked to the tokens with an empty 
preceding token set. A last core is linked from the tokens 
with an empty following token set. Finally, for each one 
of the other tokens in the LA graph, a preceding core is 
linked to it. Links between tokens in the LA graph are 
converted into links from tokens to the cores preceding 
each token of their following token set in the ELA graph. 



C. Chart Parsing Phase 

The Fence chart parsing phase processes the ELA 
graph and generates an implicit parse graph (I-graph). 
Nodes in the I-graph are described as {start, end, symbol) 
tuples, where start and end identify the substring of the 
input string, and symbol identifies the production LHS. 
It should be noted that ambiguities, both lexical and syn- 
tactic, are implicit in the I-graph nodes, as they contain 
no information about their contents. The I-graph con- 
tains a set of starting nodes, each of which may represent 
several parse tree roots. The parsing itself is performed 
by progressively applying productions and storing han- 
dles in cores. 

The grammar productions with an empty RHS (i.e. 
epsilon productions) are removed from the grammar and 
their LHS symbol is stored in the epsilonSymbols set. 
This set allows these parse symbols being skipped when 
found in a production, as if a reduction using the epsilon 
production were applied. 

The agenda is a stack of (handle, node) in which the 
node can match the symbol after the dot in the dotted 
rule of the handle. It is initially empty. 

The already Generated handle set contains all the 



agenda entries ever generated and inhibits the genera- 
tion of duplicate entries. 

The parser is initialized by generating a handle for each 
production and adding them to every core, as shown in 
Figure [lOl 

The addHandle procedure in FigurelHlis responsible for 
adding a handle to a core. It also adds the correspond- 
ing agenda entries for that handle with the nodes that 
follow the core and match the symbol after the dot in 
the dotted production of the handle. It should be noted 
that the addHandle procedure considers epsilon produc- 
tions: if a production RHS symbol is in the epsilonSym- 
bols set, both the possibilities of it being reduced or not 
by that production are considered; that is, a new han- 
dle that skips that element is added to the same core. It 
should also be noted that element are skipped iteratively, 
as many consecutive RHS symbols of a production could 
be in the epsilonSymbols set. 

procedure addHandle (Production p, int matched, 

ImplicitNode first , ImplicitNode n, 
Stack< [Handle , ImplicitNode] > agenda) : 

offset = 
do: 

next = matched+of f set 

nextSymbol = p . right [next] . symbol 

h = new HandleCp, next , first , first . startlndex) 

if ! n . core . contains (h) : 

n. core . add(h) 
if n. symbol == nextSymbol: 

if ! alreadyGenerated. contains ( [h,n] ) : 

agenda . push ( [h , n] ) 

alreadyGenerated. add(]h,n] ) 
off set++ 

while epsilonSymbols . contains (nextSymbol) && 
next<r . right . size 

Figure 9 Pseudocode of the ancillary addHandle procedure. 



agenda = {} 

for each Production p in productionSet : 
for each ImplicitNode n in nodeSet : 
addHandle (p , , n , n , agenda) 

Figure 10 Pseudocode of the chart parser initialization. 

The parsing process consists in iteratively extracting 
entries consisting of handles and nodes from the agenda 
and matching the next symbol of the RHS of the han- 
dle production with the node. The handles whose pro- 
ductions are successfully matched are added to the cores 
following the node and the agenda is updated with the 
entries that contain any of the newly generated handles. 
In case all the symbols of a production RHS match a 
sequence of nodes, a new node is generated by reduc- 
ing them. The new node start index is obtained from 
the handle, its end position is obtained from the last 
node matched, and its symbol is the LHS symbol of the 
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while ! agenda . empty : 
[h,n] = agenda. pop 

if h. dotposition == h. production. right . size-1 : 
// Production matched all its elements. 
// i.e. Reduction 

nn = new ImplicitNode(h. startlndex, 
n . endlndex , 
p . left . symbol) 
h. first . core .following. add (nn) 
nn. preceding. add (h. first . core) 
for each Core c in n. following: 
c .preceding. add (nn) 
nn. following. add(c) 
for each Heoidle hn in 

h. first . core . waitingFor (nn. symbol) : 
hadd = new Handle (hn. production, hn. next , 

hn. first ,hn. startlndex) 
agenda . push ( [hadd , n] ) 
else : 

// i.e. Shift 

for each Core c in n. following: 

for each ImplicitNode nnext in c. following: 
addHandle (h. product ion, h.next+1 ,h. first , 
h . St art Index , agenda) 

Figure 11 Pseudocode of the Fence parsing phase. 



production. When a newly generated node only has the 
starting core in its preceding core set and the final core in 
its following core set, and its symbol corresponds to the 
initial symbol of the grammar, it is added to the parse 
graph starting node set, which means that that node rep- 
resents a valid parse. The pseudocode for this process is 
shown in Figure [TTl 

The result of the chart parsing phase is an I-graph, 
which the constraint enforcement phase accepts as input. 

The Fence chart parsing phase order of efficiency is 
theoretically equivalent to existing Earley chart parsers. 
That is, O(n^) in the general case, O(n^) for unambigu- 
ous grammars, and 0{n) for almost all LR(k) grammars, 
being n the length of the input string. 



D. Constraint Enforcement Phase 

The Fence constraint enforcement phase processes the 
I-graph and generates an explicit parse graph (E-graph, 
or just parse graph) by enforcing the constraints de- 
fined for the language. Nodes in the E-graph that rep- 
resent tokens are still defined as (start, end, symbol) tu- 
ples. Nodes in the E-graph that represent nonterminal 
symbols reference the list of nodes that matched the pro- 
duction used to generate those nodes. It should be noted 
that ambiguities, both lexical and syntactic, are explicit 
in the E-graph, as it represents several parse trees cor- 
responding to all the possible interpretations of the in- 
put string. The E-graph contains a set of starting nodes, 
each of which represents a parse tree root. Constraint en- 
forcement is performed by converting each implicit node 



into every possible explicit node sequence that can be 
derived from the implicit node and satisfies the specified 
constraints; that is, by expanding the each implicit node. 

Only the nodes that conform valid parse trees are 
needed in the parse graph. In order to generate only 
these nodes, each one of the implicit nodes in the start- 
ing node set of the I-graph is recursively expanded using 
memoization. Each possible resulting explicit node is the 
root of a parse tree in the E-graph. 



1. Algorithm Description 

The expansion of an implicit node is performed by 
finding every possible reduction of a sequence of explicit 
nodes that generates that node. Each one of these reduc- 
tions produces an explicit node. Whenever an implicit 
node is found and needed in order to make the reduc- 
tions progress, it is expanded recursively. It should be 
noted that this procedure is different from parsing itself 
in that the actual bounds of the reductions for every node 
are known. 

The expand procedure in Figure[T2]expands an implicit 
node by applying every possible production that could 
generate it and produces a set of explicit nodes. The 
use of the history set inhibits entering an infinite loop 
when processing infinitely recursive production sets, as it 
avoids the expansion of a node as an indirect requirement 
of expanding the same node. 

The apply procedure in Figure [T3] applies a production 
by matching the RHS symbol given by the matched + 1 
index of it with the n node, expanding the nodes that 
follows it, and recursively applying the next RHS symbols 
of the production. 

The checkConstraints procedure is the responsible for 
the enforcement of the constraints specified by the devel- 
oper. 



2. Supported Constraints 

Fence supports associativity constraints, selection 
precedence constraints, composition precedence con- 
straints, and custom-designed constraints. 

The fact that the constraint check is performed during 
the graph expansion improves the parser performance, 
as the sooner constraints are applied, the more inter- 
pretations are discarded. For example, in the case of 
a binary expression with left-to-right associative opera- 
tors, the string "2+5+3-f'5-K6-f 2-f 1-I-5-K6-I-3" can be ex- 
panded in 10! possible ways without considering the as- 
sociativity constraint, and in just 1 possible way when 
considering it. 

• Associativity constraints allow the specification 
of the associative property for binary operators. 
The application of a production is inhibited when 
one of the nodes that matches its RHS symbols has 
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procedure expandClmplicitNode n, 

Set<ImplicitNode> history, 
Map<ImplicitNode , 
Set<Node» alreadyExpanded) 
returns Set<Symbol> : 
if alreadyExpanded. contains (n) : // memoization 

return alreadyExpanded . get (n) 
else: 

// the history set avoids infinite loop in 

// recursive production sets 
if Ihistory. contains (n) : 
history . add(n) ; 

// try to apply every production 
for each Production p with 

LHS symbol == n.sjnnbol: 
for every ImplicitNode pn with 

startlndex == n. startlndex : 
if pn ! = n && pn . endlndex<=n . endlndex : 
if p.mayMatchCpn. symbol) : 

// apply production p to each 

// expanded symbol of pn 

pn . expandeds = expandCpn, history, 

alreadyExpanded) 
for each Node nn in pn . expandeds : 
out += apply (p,nn,0,{}-, 

alreadyExpanded , history) 
alreadyExpanded . put (n , out ) 
return out 

Figure 12 Pseudocode of the expand procedure that obtains 
every possible derivation of a given node in the parse graph. 

an associativity constraint and is followed (for left- 
to-right associativity constraints), preceded (for 
right-to- left associativity constraints) , or either fol- 
lowed or preceded (for non-associative associativity 
constraints) by a node that was derived using the 
same production. 

• Selection precedence constraints allow the res- 
olution of syntactic ambiguities caused by differ- 
ent explicit nodes (i.e. interpretations) resulting 
from a single implicit node. For example, a State- 
ment can be either an OutputStatement or a Func- 
tionCall. Both OutputStatement and FunctionCall 
can match the input string "output (var) ;" , there- 
fore OutputStatement can be set to precede Func- 
tionCall, which will inhibit that string from being 
considered a function call. The application of a 
production is inhibited when it is preceded by a 
different production and both of them match the 
same node sequence. 

• Composition precedence constraints allow the 
resolution of syntactic ambiguities when a node de- 
rived using a production cannot be derived using 
another production. For example, one of the pro- 
ductions ConditionalStatement ::= "if" Expression 
Sentence and ConditionalStatement ::= "if" Ex- 
pression Sentence "else" Sentence can be set to 



procedure applyCProduction p. Node n, int matched, 
List<Node> content, 

Map<ImplicitNode , Set<Node» alreadyExpanded, 
Set<ImplicitNode> history) 
returns Set<Node> : 
if matched == p.right . size : 

n = new Node(p. symbol, p, content) 
if checkConstraints(n) : 
return -[nj 
else : 

offset = 

next = matched+of f set 
do: 

if p . right [next] . symbol == n . symbol : 
for each ImplicitNode pn in 

n.followingNodesO : 
if pn is the next symbol to match 

in the production: 
// keep applying production to each 
// expanded S3niibol of pn 
expandeds = expand(pn, history, 

alreadyExpanded) 
for each Node nn in expandeds : 

out += apply(p,nn,next+l , content +n, 

alreadyExpanded, history) 

off set++ 

next = matched+of f set 
while epsilonSymbols . contains (nextSymbol) && 
next<r . right . size && 
p . right [next] . symbol == n . symbol 

return out 

Figure 13 Pseudocode of the ancillary apply procedure that 
applies a production. 



precede the other one in order to resolve the ambi- 
guity in "if exprl if expr2 scntl else sent2" , in which 
"else scnt2" could be assigned to either the inside or 
outside conditional sentence. The application of a 
production is inhibited when it precedes any of the 
productions used to derive the nodes that matched 
its RHS symbols. 

• Custom-designed constraints allow the specifi- 
cation of any other constraints (e.g. semantic con- 
straints) . In order to enforce custom-designed con- 
straints, an evaluator can be assigned to any pro- 
duction. Whenever a node is generated, the eval- 
uator of the production used to derive it gets ex- 
ecuted and determines whether the node satisfies 
the constraint or not. In the later case, its gen- 
eration is inhibited. Custom-designed constraints 
provide a very extensible framework which allows 
developers to design complex syntactic or semantic 
constraints (e.g. probabilistic constraints, corpus- 
based constraints) that effectively limit the possi- 
ble interpretations of an input string and, as a side 
effect, improve the performance of the parser, as 
pruned partial interpretations are discarded as soon 
as they do not fulfill the constraints. 



7 



The result of the constraint enforcement step is an E- 
graph or parse graph, such as the one shown in Figure 

m 

The Fence constraint enforcement phase improves the 
performance of traditional techniques phases in practice, 
as all constraints are applied at the earliest possible time, 
thus discarding possibilities that would otherwise be pro- 
cessed later. 



IV. CONCLUSIONS AND FUTURE WORK 

We have presented Fence, an efficient bottom-up chart 
parsing algorithm with lexical and syntactic ambiguity 
support. Its constraint-based ambiguity resolution mech- 
anism enables the use of model-based language specifi- 
cation in practice. In fact, the ModelCC model-based 
language specification tool [20] generates Fence parsers. 

Fence accepts a lexical analysis graph as input, per- 
forms syntactic analysis conforming to a formal context- 
free grammar specification and a set of constraints, and 
produces as output a compact representation of the set 
of parse trees accepted by the language. 

Fence applies constraints while expanding the parse 
graph. Thus, it improves the performance of traditional 
techniques in practice, as the sooner constraints are ap- 
plied, the less processing time and memory the parser 
will require. 

In the future, we plan to apply the ModelCC model- 
based language specification tool, which relies on Fence, 
to the implementation of programming tools, model- 
driven software development, data integration, corpus- 
based induction of models, text mining, and natural lan- 
guage processing. 
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